A Reinforcement Learning Method for Maximizing Undiscounted Rewards

نویسنده

  • Anton Schwartz
چکیده

While most Reinforcement Learning work utilizes temporal discounting to evaluate performance, the reasons for this are unclear. Is it out of desire or necessity? We argue that it is not out of desire, and seek to dispel the notion that temporal discounting is necessary by proposing a framework for undiscounted optimization. We present a metric of undiscounted performance and an algorithm for finding action policies that maximize that measure. The technique, which we call Rlearning, is modelled after the popular Q-learning algorithm [17]. Initial experimental results are presented which attest to a great improvement over Q-learning in some simple cases.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

We derive sublinear regret bounds for undiscounted reinforcement learning in continuous state space. The proposed algorithm combines state aggregation with the use of upper confidence bounds for implementing optimism in the face of uncertainty. Beside the existence of an optimal policy which satisfies the Poisson equation, the only assumptions made are Hölder continuity of rewards and transitio...

متن کامل

An Analysis of Direct Reinforcement Learning in Non-Markovian Domains

It is well known that for Markov decision processes, the policies stable under policy iteration and the standard reinforcement learning methods are exactly the optimal policies. In this paper, we investigate the conditions for policy stability in the more general situation when the Markov property cannot be assumed. We show that for a general class of non-Markov decision processes, if actual re...

متن کامل

An Analysis of non-Markov Automata Games: Implications for Reinforcement Learning

It has previously been established that for Markov learning automata games, the game equilibria are exactly the optimal strategies (Witten, 1977; Wheeler & Narendra, 1986). In this paper, we extend the game theoretic view of reinforcement learning to consider the implications for \group rationality" (Wheeler & Narendra, 1986) in the more general situation of learning when the Markov property ca...

متن کامل

An Average - Reward Reinforcement Learning

Recently, there has been growing interest in average-reward reinforcement learning (ARL), an undiscounted optimality framework that is applicable to many diierent control tasks. ARL seeks to compute gain-optimal control policies that maximize the expected payoo per step. However, gain-optimality has some intrinsic limitations as an optimality criterion, since for example, it cannot distinguish ...

متن کامل

Improved Regret Bounds for Undiscounted Continuous Reinforcement Learning

We consider the problem of undiscounted reinforcement learning in continuous state space. Regret bounds in this setting usually hold under various assumptions on the structure of the reward and transition function. Under the assumption that the rewards and transition probabilities are Lipschitz, for 1-dimensional state space a regret bound of Õ(T 3 4 ) after any T steps has been given by Ortner...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1993